Random Forest

Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma

TestSlide

Code
data <- read_csv("All_threat_data.csv")

ggplot(data, aes(x = factor(Status), fill = factor(Status))) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Barplot of Status", 
       x = "Status", 
       y = "Frequency") +
  theme_minimal() +
  theme(text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5),
        axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_line(color = "grey80"),
        panel.grid.minor = element_blank())

Random Forest Overview

  • Ensemble machine learning method based on a large number of decision trees voting to predict a classification
  • Benefits compared to decision tree:
    • Able to function with incomplete data -Lower likelihood of an overfit -Improved prediction accuracy

Bootstrap Sampling (Bagging)

  • Each decision tree uses a random sample of the original dataset
    • Using a subset of the dataset reduces the probability of an overfit model
    • Rows with missing data will often be left out of the sample, improving performance
    • Performed with replacement

Random Feature Selection

  • A random set of features is selected for each node in training
    • Information about feature importance may be saved and applies in future iterations
    • Even with automated random feature selection, feature selection and engineering prior to training may improve performance

Code
ctrl <- trainControl(method = "cv",  number = 10) 

bagged_cv <- train(
  Group~ LF + GF + Biomes + Range + Habitat_degradation +  
     Habitat_loss + IAS + Other + Unknown + Other + Over_exploitation,
  data    = species_train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE)

plot(varImp(bagged_cv), 10)

Cross Validation

  • Validation of performance of model
    • Resampling method similar to bootstrapping, but without replacement
    • Allows approximation of the general performance of a model

Code
 m3 <- rpart(
   formula = Group~ LF + GF + Biomes + Range +
     Habitat_degradation + Habitat_loss + IAS +
     Other + Unknown + Other + Over_exploitation,
   data    = species_train,
   method  = "anova"
 )
 rpart.plot(m3)

Prediction

  • Each trained decision tree produces its own prediction
    • Decision trees are independent, and were trained on different subsets of both data and features

Ensemble Voting

  • The results from each decision tree are combined into a voting classifier
    • The mode of the classification results will be the final prediction

Dataset

  • South African Red List
    • Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
  • Purpose
    • Predict whether or not an unknown plant is threatened based on the above characteristics

Visuals 1

  • Distribution Range

Code
ggplot(data = data, aes(x = Status, y = Range, fill = Status)) +
  geom_boxplot() +
  theme_bw() +
  ylim(0,100000)

Visuals 2

  • Cramer’s V Association with Range binned into 20 categories
    • Target feature Group is most associated with Range, Family, Habitat Loss, Biome, and GF
    • The most associated features will likely be the most important features during model training
    • Colinearity does not appear to be present, further checks are

Code
#Binning Range to make it categorical
corrDFRange <- corrDFRange %>% mutate(Range=ntile(Range, n=20))
corrplot::corrplot(DescTools::PairApply(corrDFRange,DescTools::CramerV), type='lower')

Analysis

  • 5 separate random forest models were created using separate methods of normalization

Data Preparation

  • Preprocessing
    • Encode categorical features into numerical / factor features
    • Split the training set into a training and test set, avoiding class imbalance

Preprocessing

  • Class Imbalance
    • Resample smaller classes in order to approximate equal classes
    • Training on imbalanced datasets will bias predictions to the larger class

Normalization

Prediction

  • Combine results into a vector
  • Identify the most frequently predicted class
  • Iterate over entire test set, storing results
  • Generate a confusion matrix, calculate the sensitivity, and precision for each category
  • Iterate after tuning if necessary

Results

  • Range was found to be the strongest predictor of extinction
  • Habitat loss is the second strongest predictor of extinction